Introduction

Defining problem statement

Knowing house prices is very important to both a home buyer and seller. Because each party would want to get the best deal, the price cannot be too high or too low. The banks would also do due deligent in order not to finance an over valued house. It is therefore imperative for both a buyer and seller to get an acceptable appraisal for the house, which would be agreeable to all parties in the transaction. A number of factors affect home prices: the economy, the number of houses of similar kinds sold in the area in the recent past,and the features on the house.

Having data from house sales at several communities would be vital in building machine learning models that can predict house prices. This would apprise financial institutions as to when to send out offers for refinance for home owners, allow home sellers and buyers to have a good estimate of homes and so not to pay for home evaluations unless the banks demand one.

Data Dictionary

● cid: a notation for a house

● dayhours: Date house was sold

● price: Price is prediction target (in $)

● room_bed: Number of Bedrooms per house

● room_bath: Number of bathrooms per bedrooms

● living_measure: square footage of the home

● lot_measure: square footage of the lot

● ceil: Total floors (levels) in house

● coast: House which has a view to a waterfront (0 - No, 1 - Yes)

● sight: Has been viewed

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 1

● condition: How good the condition is (Overall out of 5)

● quality: grade given to the housing unit, based on grading system

● ceil_measure: square footage of house apart from basement

● basement_measure: square footage of the basement

● yr_built: Built Year

● yr_renovated: Year when house was renovated

● zipcode: zip code

● lat: Latitude coordinate

● long: Longitude coordinate

● living_measure15: Living room area in 2015 (implies-- some renovations) This might or might not have affected the lot size area

● lot_measure15: lotSize area in 2015 (implies-- some renovations)

● furnished: Based on the quality of room (0 - No, 1 - Yes)

● total_area: Measure of both living and lot

There are missing values

Data was collected daily from May May 2, 2014 and May 27, 2015 It appears the data was collected at the same time of the day, or that wasn't necessary to them.

The data has 21613 rows and 23 columns

Note Drop CID

Price is right skewed

The average number of bedrooms for the houses is 3

The are 1-4 number of bathrooms per house

The average number of bathrroms being 2

Living measure or the square footage of the home right skewed

The average square footage of the home is approximately 2100

The maximum is 6000

-The lot measure is right skewed

The quality or grade given to the houses range from 1 to 12

The average qualit given is about 7.7, and the median being 7

histogram_boxplot(df, "ceil_measure")

Cell measure is right skewed

Average is about 180 square foot

labeled_barplot(df, "dayhours", n = 20)

Most houses are 1 or two floors

99.1 percent of the homes are in non-coastal location

I will consider correlations greater than 5 as significant

Factors highly corrlated with price are:

furnished, living measure, ceil measure, quality, l

No feature was negatively correlated to price

Plotting Values that are highly corrlated

The more sighted the property, the higher the price

The prices don't change significantly with age of the building

Data engineering

New missing rows found in coast and long

room_bed/room_bath

No missing value

Data Preprocessing (contd.)

Outlier Detection

Outlier Treatment

The log of price is uniformly distributed.

The distribution of living measure is right skewed.

The areas with zipcode starting 980 have about 25 of houses furnished. Those with 981 have only about 10% of the houses furnished

Linear Model Building

  1. We want to predict the log of price of houses.

  2. Before we proceed to build a model, we'll have to encode categorical features.

  3. We'll split the data into train and test to be able to evaluate the model that we build on the train data.

  4. We will build a Linear Regression model using the train data and then check it's performance.

Let's check the performance of the model using different metrics.

Observations

Linear Regression using statsmodels

Observation

Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:

  1. No Multicollinearity

  2. Linearity of variables

  3. Independence of error terms

  4. Normality of error terms

  5. No Heteroscedasticity

Removing Multicollinearity

To remove multicollinearity

  1. Drop every column one by one that has a VIF score greater than 5.
  2. Look at the adjusted R-squared and RMSE of all these models.
  3. Drop the variable that makes the least change in adjusted R-squared.
  4. Check the VIF scores again.
  5. Continue till you get all VIF scores under 5.

Let's define a function that will help us do this.

The above predictors have no multicollinearity and the assumption is satisfied.

Let's check the model performance.

Observations

The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

TEST FOR LINEARITY AND INDEPENDENCE

Why the test?

How to check linearity and independence?

How to fix if this assumption is not followed?

Conclusion

Builing Alternative Models

Random Forest

On train data

On test

Decision Tree

On train data

On test data

XGBoost

On train data

On test data

We will cross check the model scores for XGB

Ridge and Lasso

Ridgen on train set

Ridge on test set

Lasso on train set

Lasso on test set

Conclusion